28 research outputs found
Calibrated Fairness in Bandits
We study fairness within the stochastic, \emph{multi-armed bandit} (MAB)
decision making framework. We adapt the fairness framework of "treating similar
individuals similarly" to this setting. Here, an `individual' corresponds to an
arm and two arms are `similar' if they have a similar quality distribution.
First, we adopt a {\em smoothness constraint} that if two arms have a similar
quality distribution then the probability of selecting each arm should be
similar. In addition, we define the {\em fairness regret}, which corresponds to
the degree to which an algorithm is not calibrated, where perfect calibration
requires that the probability of selecting an arm is equal to the probability
with which the arm has the best quality realization. We show that a variation
on Thompson sampling satisfies smooth fairness for total variation distance,
and give an bound on fairness regret. This complements
prior work, which protects an on-average better arm from being less favored. We
also explain how to extend our algorithm to the dueling bandit setting.Comment: To be presented at the FAT-ML'17 worksho
Sequential Principal-Agent Problems with Communication: Efficient Computation and Learning
We study a sequential decision making problem between a principal and an
agent with incomplete information on both sides. In this model, the principal
and the agent interact in a stochastic environment, and each is privy to
observations about the state not available to the other. The principal has the
power of commitment, both to elicit information from the agent and to provide
signals about her own information. The principal and the agent communicate
their signals to each other, and select their actions independently based on
this communication. Each player receives a payoff based on the state and their
joint actions, and the environment moves to a new state. The interaction
continues over a finite time horizon, and both players act to optimize their
own total payoffs over the horizon. Our model encompasses as special cases
stochastic games of incomplete information and POMDPs, as well as sequential
Bayesian persuasion and mechanism design problems. We study both computation of
optimal policies and learning in our setting. While the general problems are
computationally intractable, we study algorithmic solutions under a conditional
independence assumption on the underlying state-observation distributions. We
present an polynomial-time algorithm to compute the principal's optimal policy
up to an additive approximation. Additionally, we show an efficient learning
algorithm in the case where the transition probabilities are not known
beforehand. The algorithm guarantees sublinear regret for both players
Markov Decision Processes with Time-Varying Geometric Discounting
Canonical models of Markov decision processes (MDPs) usually consider
geometric discounting based on a constant discount factor. While this standard
modeling approach has led to many elegant results, some recent studies indicate
the necessity of modeling time-varying discounting in certain applications.
This paper studies a model of infinite-horizon MDPs with time-varying discount
factors. We take a game-theoretic perspective -- whereby each time step is
treated as an independent decision maker with their own (fixed) discount factor
-- and we study the subgame perfect equilibrium (SPE) of the resulting game as
well as the related algorithmic problems. We present a constructive proof of
the existence of an SPE and demonstrate the EXPTIME-hardness of computing an
SPE. We also turn to the approximate notion of -SPE and show that an
-SPE exists under milder assumptions. An algorithm is presented to
compute an -SPE, of which an upper bound of the time complexity, as a
function of the convergence property of the time-varying discount factor, is
provided.Comment: 24 pages, 3 figure
A randomized controlled trial of vaginal misoprostol tablet and intracervical dinoprostone gel in labor induction of women with prolonged pregnancies
Background: Objective of the study was to compare the efficacy of vaginal misoprostol and intracervical dinoprostone gel for induction of labor in women with unfavorable cervix beyond 41 weeks (287 days) of gestation.Methods: This randomized controlled trial was performed at a teaching hospital between January 2011 and December 2012. 192 women with singleton uncomplicated pregnancy with no previous uterine scar not going into spontaneous labor at 288th days of gestation .Misoprostol(25 mcg tablet)in the posterior vaginal fornix, four hourly, maximum six doses or Dinoprostone (0.5 mg gel) intracervical instillation ,six hourly, maximum three doses were given.Oxytocin was administered if needed. Primary outcome: Induction delivery interval (IDI) with incidence of delivery within 12 hours and 24 hours; mode of delivery: vaginal or caesarean section. Secondary outcome: maternal side effects, neonatal outcome. For statistical analysis chi-square test, student t- test and P-value determination were done.Results: The mean IDI was shorter in the misoprostol group compared to the dinoprostone group (p0.05). Adverse neonatal outcome (5-minutes Apgar score0.05).Conclusions: Vaginal misoprostol tablet is a safe and more effective method of induction of labour when compared with intracervical dinoprostone gel in prolonged pregnancies.
Adversarial blocking bandits
We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1)